Machine

您所在的位置：网站首页 › 4 examples of chemical characteristics › Machine

Machine

#Machine| 来源: 网络整理| 查看: 265

Materials and instruments

4-Hydroxybenzaldehyde, 4-(2-chloroethyl)morpholine, 2-tetralone, ammonium acetate, 4-bromo-N,N-diphenylaniline, (5-formylthiophen-2-yl)boronic acid, PdCl2(dppf), 1,1,2-trimethyl-1 H-benzo[e]indole, 2-(4-nitrophenyl)acetonitrile, iodoethane and solvents were all purchased from Sigma Aldrich and used as received without further purification. Chloroform and ethanol was obtained from Macklin reagent. 2-Distearoyl-sn-glycero-3-phosphoethanolamine-N-[methoxy(polyethylene glycol)-2000 (DSPE-mPEG2000) was purchased from Xi’an ruixi Biological Technology Co., Ltd. PBS (pH 7.4) was purchased from Beyotime Biotechnology and Lyso-tracker Green were purchased from Sigma Aldrich. DMEM medium, fetal bovine serum (FBS), penicillin and streptomycin were purchased from Gibco. 3-Ethyl-1,1,2-trimethyl-1 H-benzo[e]indol-3-ium iodide was synthesized according to the literature method[49].

1 H and 13 C NMR spectra were recorded with a Bruker ARX 400 NMR spectrometer using CDCl3 or DMSO-d6 as solvent. Liquid Chromatography-Mass Spectrometry (LC-MS) was recorded on a Thermo Scientific LCQ Fleet. High-resolution mass spectra (HRMS) were recorded on XEVO G2-XS QTOF Mass Spectrometer System operating in a Matrix-Assisted Laser Desorption/Ionization Time of Flight (MALDI-TOF) mode. UV–vis absorption spectra were measured on a PerkinElmer Lambda 950 spectrophotometer. Photoluminescence (PL) spectra were recorded on Edinburgh FS5 fluorescence spectrophotometer. The particle size and zeta potential were measured using a Malvern Zetasizer Nano-ZS90. The particle size and morphology were observed on a HITACHI-HT7700 transmission electron microscope (TEM). Confocal laser scanning microscopy (CLSM) images were collected on a ZEISS-LSM880 CLSM. The chemical structures of the final products have been confirmed by NMR spectra and mass spectra (Figures S1–S11, Supporting Information).

Synthesis of PM[50]

To a mixture of 3-hydroxybenzaldehyde (1.0 g, 8.20 mmol) in acetonitrile were added 4-(2-chloroethyl)morpholine (1.2 g, 8.20 mmol) and anhydrous potassium carbonate (1.68 g, 12.3 mmol), and the mixture was refluxed for 10 h. The mixture was filtered and dried to give the crude product. The crude product was finally purified by column chromatography (DCM/MeOH = 50:1) to yield the desired compound as brown oil (1.39 g, yield 72%). 1 H NMR (500 MHz, CDCl3 ) δ 9.88 (s, 1 H), 7.90–7.77 (m, 2 H), 7.09–6.94 (m, 2 H), 4.20 (t, J = 5.7 Hz, 2 H), 3.80–3.67 (m, 4 H), 2.84 (t, J = 5.6 Hz, 2 H), 2.67–2.49 (m, 4 H).

Synthesis of PTMM[51]

In a 50 mL round-bottomed flask, added ammonium acetate (0.77 g, 10 mmol), PM (0.24 g, 1.0 mmol), 2-tetralone (0.30 g, 2.0 mmol), 10 mL of glacial acetic acid and stirred for 24 h at room temperature. After the completion of the reaction as monitored by TLC, the resulting product was poured into water. The solid was obtained, filtered, and purified by column chromatography (PE/EA = 10:1) as eluent to give the light green solid (0.42 g, yield: 85%). 1 H NMR (400 MHz, CDCl3) δ 7.58–7.50 (m, 1 H), 7.46–7.36 (m, 3 H), 7.35–7.31 (m, 2 H), 7.37–7.29 (m, 1 H), 7.28 (s, 1 H), 7.03–6.93 (m, 2 H), 6.94–6.88 (m, 2 H), 4.20 (t, J = 5.8 Hz, 2 H), 3.81 (t, J = 4.4 Hz, 4 H), 3.21–3.09 (m, 4 H), 3.03–2.95 (m, 2 H), 2.88 (s, 2 H), 2.80 (t, J = 6.5 Hz, 2 H), 2.66 (s, 4 H). 13 C NMR (101 MHz, CDCl3) δ 158.48, 158.13, 153.52, 145.75, 138.72, 133.28, 131.18, 129.54, 128.70, 127.89, 127.61, 127.22, 126.91, 126.07, 125.72, 114.56, 66.87, 65.79, 57.69, 54.15, 33.27, 29.62, 29.55, 29.32. LC-MS : m/z, cal 488.246, found: 489.521 [M + H]+; Retention time = 0.881 min.

Synthesis of TTA[52]

A solution of 4-bromo-N,N-diphenylaniline (1.0 g, 3.0 mmol) and (5-formylthiophen-2-yl)boronic acid (0.63 g, 4.0 mmol) was refluxed under nitrogen in the mixed toluene/MeOH (20 mL: 20 mL) in the presences of PdCl2(dppf) (0.23 g, 0.31 mmol) and K2CO3 (2.13 g, 15.4 mmol) for 24 h. The combined organic phase was filtered and dried to obtain the crude product, which was further purified by silica-gel chromatography (PE/DCM = 2:1) to obain the yellow solid (0.74 g, yield: 69.2%). 1 H NMR (400 MHz, CDCl3) δ 9.88 (s, 1 H), 7.73 (d, J = 3.9 Hz, 1 H), 7.57–7.52 (m, 2 H), 7.35–7.29 (m, 5 H), 7.19–7.14 (m, 4 H), 7.14–7.06 (m, 4 H).

Synthesis of TTNA[53]

A solution of 5-(4-(diphenylamino)phenyl)thiophene-2-carbaldehyde (0.177 g, 0.5 mmol) and 2-(4-nitrophenyl)acetonitrile (0.810 g, 0.5mmol) were added to ethanol (20 mL) with a drop of piperidine and refluxed for 5 h. It was then cooled down to room temperature and produced a black product, which was then filtered, washed three times with cold ethanol, and dried in a vacuum (0.167 g, 67%). 1 H NMR (500 MHz, CDCl3) δ 8.32–8.26 (m, 2 H), 7.85–7.75 (m, 3 H), 7.65 (d, J = 4.0 Hz, 1 H), 7.57–7.52 (m, 2 H), 7.34–7.26 (m, 5 H), 7.17–7.11 (m, 4 H), 7.11–7.03 (m, 4 H). 13 C NMR (126 MHz, CDCl3) δ 151.92, 149.07, 147.42, 147.12, 140.66, 137.34, 136.64, 135.48, 129.62, 127.28, 126.28, 126.10, 125.27, 124.55, 123.99, 122.99, 122.61, 117.84, 103.97. MALDI-TOF (ESI): m/z calcd for C31H21N3O2S [M]+,499.1354; found, 499.1354.

Synthesis of TTBI

The synthetic procedure for the preparation of TTBI[52]. A solution of 5-(4-(diphenylamino)phenyl)thiophene-2-carbaldehyde (0.1 g, 0.3 mmol) and 3-ethyl-1,1,2-trimethyl-1 H-benzo[e]indol-3-ium iodide (0.13 g, 0.36 mmol) was refluxed in dry ethanol catalyzed by a few drops of piperidine for 10 h under nitrogen. After cooling to room temperature, the solvent was evaporated under reduced pressure. The residue was purified by silica-gel chromatography (DCM/MeOH = 20:1) to give the purple-black solid (0.19 g, yield: 88.4%). 1 H NMR (400 MHz, CDCl3) δ 8.70 (d, J = 15.5 Hz, 1 H), 8.51 (d, J = 4.1 Hz, 1 H), 8.25 (d, J = 8.4 Hz, 1 H), 8.15–8.05 (m, 3 H), 7.85–7.62 (m, 3 H), 7.59 (d, J = 8.4 Hz, 2 H), 7.45 (s, 1 H), 7.37–7.32 (m, 4 H), 7.17 (t, J = 7.6 Hz, 6 H), 7.07 (d, J = 7.2 Hz, 2 H), 4.90 (m, J = 7.5 Hz, 2 H), 2.17 (s, 6 H), 1.48 (t, J = 7.3 Hz, 3 H). 13 C NMR (101 MHz, CDCl3) δ 180.19, 157.17, 149.94, 145.76, 142.12, 137.84, 133.42, 131.70, 130.29, 129.64, 128.57, 127.69, 125.60, 124.43, 122.78, 121.67, 111.74, 107.20, 46.27, 27.18, 22.69. LC-MS : m/z, cal 575.252, found: 575.552 [M]+; Retention time = 0.979 min.

Synthesis of NPs

Fabrication of NPs was carried out by injecting THF solution (0.5ml) of AIEgens (1 mg) and DSPE-PEG2000 (5 mg) into 5ml of ultrapure water and stirring vigorously for 2 min. The prepared NPs were purified for a day using ultrapure water dialysis (molecular weight cutoff of 100 kDa). After that, NPs were ultrafiltered for 20 min at 4400 rpm through ultrafiltration tubes with a molecular weight cutoff of 100 kDa. After ultrafiltration, the NPs were dispersed in 1°x PBS buffer (pH 7.4) and kept out of the light at 4 °C.

Cell culture, imaging

HeLa cells were cultured in a DMEM medium that contained 10% FPS at 37 °C in a 5% CO2 atmosphere. After incubating HeLa cells with NPs (20 µg/mL) in glass bottom dishes for 4 h, 200 nM Lyso-Tracker was added, incubated for 30 min. After that, the dishes were washed with PBS 3 times and visualized by Confocal laser scanning microscopy (CLSM) immediately.

Molecule descriptors

Molecule descriptors were a crucial step in molecular machine learning to encode molecules and extract structural information. Quantitative structure-activity relationship (QSAR) was a crucial tool in chemometrics. It used mathematical-statistical methods to explain the relationship between a compound’s activity or physicochemical characteristics and its molecular structure. The foundation of QSAR studies was the calculation of molecular descriptors, and the precise definition and logical application of these descriptors were crucial to QSAR studies. The ability to obtain QSAR models with high confidence and validity depended mainly on the correct choice of descriptors. A molecular descriptor measured a molecule’s characteristics in a specific area, such as a physicochemical property or a numerical index derived from the molecule’s structure by different algorithms. More than 5000 molecular descriptors were currently accessible in a variety of software. RDKit was used to produce molecular descriptors as numerical descriptors for prediction experiments. There were two types of molecular descriptors: quantitative and qualitative. Quantitative descriptions were based on molecular graph theory, various theoretical or experimental spectral data (e.g., UV spectra), molecular composition (e.g., number of hydrogen bond donors, number of chemical bonds), physicochemical properties (e.g., ester water distribution coefficients) descriptors, molecular field descriptors, and molecular shape descriptors. Qualitative descriptors were generally referred to as molecular fingerprints. That is, some code represents a molecule’s structure, properties, fragments, or substructures. All molecular descriptors were generated by RDKit(http://www.rdkit.org).

Quantitative descriptors

Depending on the computational demands of the molecular structure dimension, quantitative descriptors could be categorized as one-dimensional, two-dimensional, three-dimensional, etc. To compute descriptors, RDKit offers a variety of methods that could be applied to molecular screening, drug toxicity testing, and other applications. Herein, 196 one- and two-dimensional descriptors, including 106 one- and 90 two-dimensional molecular descriptors, had been screened to quantify features.

Qualitative descriptors

Qualitative molecular descriptors were also known as molecular fingerprints. One of the most critical problems encountered when comparing similarities between two compounds was the complexity of the task. To make the comparison of molecules computationally easier, a certain degree of simplification or abstraction was required. A molecular fingerprint was an abstract representation of a molecule that converts (encodes) it into many bit strings (also known as bit vectors) that were then easily compared between molecules. A typical procedure extracted a molecule’s structural characteristics before hashing them to create the bit vector. Comparing molecules was hard, comparing bit vectors was easy, and comparisons between molecules must be quantifiable. Each bit on a molecular fingerprint corresponded to a molecule fragment. Molecular fingerprints were classified into several types based on the method used to convert the molecular representation into bit vectors. Common molecular fingerprinting methods include the morgan circular fingerprint, daylight fingerprint, topological torsion fingerprint, and atom-pair fingerprint.

Extended connectivity fingerprint (ECFP) was a circular topological fingerprints designed for molecular characterization, similarity search, and structure-activity modeling. Morgan connectivity fingerprint (MCP) were part of ECFP, derived from Morgan’s algorithm, and have became the industry standard method for circular molecular fingerprints, designed explicitly for constructive relationship studies. They were often used in ML as a benchmark for comparing the performance of new strategies. When used, MCP first sets a defined diameter – different diameters produced different fingerprints – then employed the Morgan search algorithm to look for all substructures in the molecule with that diameter. Finally, it hashed to obtain each substructure’s hash value, forming the corresponding fingerprint. ECFPs with small diameters were typically appropriate for similarity searches and molecular clustering. Contrarily, ECFPs with large diameters gained from having more molecular structure information and were thus perfect for ML to make activity predictions.

Topological or path-based fingerprint started from an atom and took each substructure along the path until it reached a specified length, then hashed each substructure to obtain a molecular fingerprint. This fingerprint could be adjusted for quick substructure searching and molecular filtering and applied to any molecule. The most well-known examples of this type of fingerprint were daylight fingerprint, which had bits that could be up to 2048 bits long and encode every possible linkage pathway that a molecule could take to reach a specific length. Atom-pair fingerprint identified each molecule atom as the shortest path based on its environment. Topological torsion fingerprints were generated by constructing a topological double-angle descriptor using four non-hydrogen atom-pair bonding paths. Both fingerprints could be expressed in sparse form.

Machine learning model Random Forest (RF)

RF was a general-purpose ensemble learning algorithm that used the Classification and Regression Tree (CART) algorithm to reach the final conclusion after “aggregating” the results of a single fully grown regression tree constructed on a randomly chosen subset of data. Each regression tree selected a variable to reduce the Gini impurity as it grows

$$\begin{array}{c}{I}_{G}\left(p\right)=1-\sum _{i=1}^{J}{p}_{i}^{2}\end{array}$$ (1)

to lessen the chance that a new random variable would be incorrectly classified. In this case, J was the total number of classes, and pi was the likelihood that a given item belongs to class i. For the overall algorithm to be more predictive than a single regression tree and more resilient on a noisy database, RF uses bootstrap sampling and random selection of input samples to ensure that each regression tree in the forest was distinct and uncorrelated to one another. The algorithm’s accuracy would increase with a large number of regression trees.

Gradient boosting regression tree (GBRT)

The GBRT was a well-liked model that performed exceptionally well in ML applications. It was a boosting family representative algorithm. Boosting was a progressive model combination strategy. Each new regressor enhanced the predictions of the previous regressor. Thus, boosting was a technique for combining models that reduced bias. GBRT was an iterative regression tree algorithm that consisted of multiple trees. The integration technique used was gradient boosting, and the final result was the sum of the conclusions from each tree. The intuitive understanding was that each round of prediction has residuals with the actual values, the next round of prediction was made based on the residuals, and the result was obtained by summing all predictions. The GBRT process involved several iterations, with each iteration producing a weak regressor that was trained using the residuals of the previous regressor. Since the training process was made to reduce residuals, the accuracy of the final regressor was continually improved. Generally, the requirements for weak regressors were straightforward, with low variance and high bias. Classification and Regression Tree (CART) was usually chosen with weak regressors. The depth of each CART was limited due to the high bias and simplicity requirements. The final total regressor was a weighted average of the weak regressors from each training round. GBRT could be expressed as follows when a regression tree represents the basic model:

$$\begin{array}{c}{f}_{M}\left(X\right)=\sum _{m=1}^{M}T\left(X;{{\Theta }}_{m}\right)\end{array}$$ (2)

where $T\left(X;{{\Theta }}_{m)}\right)$ represents the regression tree. M was the number of trees. The forward distribution algorithm was adopted first to determine the initial boosting tree ${f}_{0}\left(X\right)=0$. Then the model in step m was:

$$\begin{array}{c}{\widehat{{\Theta }}}_{m}=arg\underset{{{\Theta }}_{m}}{\text{min}}\sum _{i=1}^{N}L({y}_{i},{f}_{\left(m-1\right)}\left({X}_{i}\right)+T({X}_{i};{{\Theta }}_{m}))\end{array}$$ (3)

where the loss function L() was used, the mean square error and the absolute value error were typically the loss functions chosen by the regression algorithm.

K-nearest neighbor (KNN)

KNN was one of the most basic regression algorithms. When the k-nearest samples of a data point were considered, the value of that data point was the average of those k values. The number of neighbors k and the calculation of distance were two crucial factors influencing KNN. K was usually an integer no larger than 20, and distance was calculated using the Euclidean distance. Euclidean distance was defined as

$$\begin{array}{c}d=\sqrt{{\sum }_{i=0}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}}\end{array}$$ (4)

Where n was the number of samples.

Support vector machine (SVM)

The Vapnik-Chervonenkis theory was the basis for the development of SVM, also known as “support vector network,“ which was a kernel-based supervised learning algorithm. For regression issues, the SVM calculated a hyperplane and fit training data to the hyperplane using a kernel function to project input data onto a higher dimensional space. The kernel function for this work was linear.

Extreme gradient boosting (XGBoost)

The XGBoost algorithm was an upgraded library of the GBRT algorithm, which significantly increased data processing effectiveness and lowered the risk of overfitting. Because it employed a sparse-aware algorithm for sparse data and trained the weighting function using first- and second-order derivatives, it was more scalable than GBRT. Similar to GBRT, XGBoost also employed a forward stepwise algorithm, and XGBoost chose the parameters for the following decision tree by minimizing structural risk.

$$\begin{array}{c}{\widehat{{\Theta }}}_{m}=arg\underset{{{\Theta }}_{m}}{\text{min}}\sum _{i=1}^{N}L({y}_{i},{f}_{\left(m-1\right)}\left({X}_{i}\right)+{\Omega }({X}_{i};{{\Theta }}_{m}))\end{array}$$ (5)

where ${\Omega }({X}_{i};{{\Theta }}_{m})$ represented the regularisation term of the regression tree, which was an important difference between XGBoost and GBRT. Similar hyperparameters were used by XGBoost and GBRT.

Multilayer perceptron (MLP)

MLP was a forward-structured artificial neural network that mapped a set of input vectors to a set of output vectors. The backpropagation algorithm, a supervised learning technique, was frequently used to train MLPs, which mimicked the human nervous system’s learning and data prediction processes. It first learned, then stored the data with weights and employed algorithms to modify the weights and lessen bias in the training process or the difference between the actual and predicted values. The input, hidden, and output layers were the three types of network layers that made up an MLP. Each layer was made up of a specific number of nodes, which were neurons with non-linear activation functions. Each layer was fully connected to the one before it. The input layer was used to receive data, the hidden layer was used to process the data, and the output layer offered the final prediction. A single network layer’s output was depicted as

$$\begin{array}{c}f\left(x\right)=f\left(\sum _{i}^{M}{\omega }_{i}{x}_{i}+b\right)\end{array}$$ (6)

where x represented the input to the node, w represented the node’s weight, b represented the bias, and f(x) represented the activation function. If each neuron’s activation function was linear, an MLP with multiple layers could be compared to a single-layer neural network. Rectified linear unit (ReLU) was a non-linear activation function used in this work.

Convolution neural network (CNN)

The convolutional neural network was a feed-forward neural network with artificial neurons that responded to a portion of the surrounding units in the coverage area. CNN comprised three layers: the input layer, the hidden layer, and the output layer, with the hidden layer containing various types of networks such as convolutional, pooling, fully connected (similar to classical neural networks), and normalization layers. The convolutional layer was the core of the CNN and performed the dot product of the convolutional kernel and the layer input matrix, this product was usually the Frobenius inner product, and the activation function was ReLU. The convolution operation produced a feature map as the convolutional kernel moved along the layer’s input matrix. This feature map then became part of the input for the subsequent layer. CNN was a desirable deep learning structure because it required fewer parameters to be considered than other deep neural networks.

Metrics

MAE (mean absolute error) of these n samples was given by

$$\begin{array}{c}MAE=\frac{1}{n}\sum _{i=1}^{n}\left|{y}_{true}^{\left(i\right)}-{y}_{pred}^{\left(i\right)}\right|\end{array}$$ (7)

RMSE (root mean squared error) of these n samples was given by

$$\begin{array}{c}RMSE=\sqrt{\frac{1}{n}\sum _{i=1}^{n}{\left({y}_{true}^{\left(i\right)}-{y}_{pred}^{\left(i\right)}\right)}^{2}}\end{array}$$ (8)

Coefficient of determination (R2) of these n samples was given by

$$\begin{array}{c}{R}^{2}=1-\frac{\sum _{i=1}^{n}{({y}_{true}^{\left(i\right)}-{y}_{pred}^{\left(i\right)})}^{2}}{\sum _{i=1}^{n}{({y}_{true}^{\left(i\right)}-\frac{1}{n}{\sum }_{j=1}^{n}{y}_{true}^{\left(j\right)})}^{2}}\end{array}$$ (9) Hyperparameters

We employed Bayesian optimization to identify each model’s ideal hyperparameters during model training[54]. This step was crucial because it has been demonstrated that properly tuned hyperparameters could produce predictions with better accuracy than those selected by hand.

10-fold cross-validation

The data were randomly divided into ten equally sized mutually exclusive subsets, each keeping the data distribution as consistent as possible. Nine subsets were taken at a time for the training set and one for the test set. This yielded ten training and test sets, and the final result was the mean of the outcomes of the ten tests.

【本文地址】

Machine

Machine

今日新闻

推荐新闻